Septic shock is one of the most common adverse events to occur to patients in hospitals (1). This disease, resultant from infection, is often deadly, and is prevalent in both developing and developed countries. Due to a variety of patient-specific factors such as antibiotic resistance, sepsis can be difficult to treat. This leads to the question of whether gene expression differences can be seen in patients that respond well to sepsis treatments, compared to those that do not.
The dataset selected for this assignment consists of 31 sepsis patients that underwent whole blood RNA sequencing. Of these 31 patients, 17 responded to treatment (R), and 14 did not (NR). The sequencing was performed at two time points: upon ICU admission (T1), and 48h after ICU admission (T2). After being admitted to the ICU, the patients received hemodynamic therapy as treatment for sepsis (2).
The experimental conditions of interest are whether the patients are Responders or Non-Responders, as we are interested in evaluating gene expression differences and how different treatments may be appropriate based on a patient’s gene expression.
Figure 1: Density plots of normalized expression of all patients at T1 and T2.
Read in normalized data:
tp1 <- read.csv("normalizedCountsTimePoint1.csv")
tp2 <- read.csv("normalizedCountsTimePoint2.csv")
rownames(tp1) <- tp1$X
tp1$X <- NULL
rownames(tp2) <- tp2$X
tp2$X <- NULL
Create the differential expression model:
model_design1 <- model.matrix(~ expGroups1$response)
exprMat1 <- as.matrix(tp1)
rownames(exprMat1) <- rownames(tp1)
colnames(exprMat1) <- colnames(tp1)
minSet1 <- ExpressionSet(assayData=exprMat1)
model_design2 <- model.matrix(~ expGroups2$response)
exprMat2 <- as.matrix(tp2)
rownames(exprMat2) <- rownames(tp2)
colnames(exprMat2) <- colnames(tp2)
minSet2 <- ExpressionSet(assayData=exprMat2)
fit1 <- limma::lmFit(minSet1, model_design1)
fit2 <- limma::lmFit(minSet2, model_design2)
bayes1 <- eBayes(fit1, trend=TRUE)
bayes2 <- eBayes(fit2, trend=TRUE)
topfit1 <- topTable(bayes1,
coef=ncol(model_design1),
adjust.method = "BH",
number = nrow(exprMat1))
output_hits1 <- merge(rownames(tp1),
topfit1,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits1 <- output_hits1[order(output_hits1$P.Value),]
knitr::kable(output_hits1[1:10,],type="html",row.names = FALSE)
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B |
|---|---|---|---|---|---|---|
| LOC220729 | -1.0465329 | 4.037170 | -3.523507 | 0.0013577 | 0.9999999 | -4.587982 |
| SCARNA9 | -20.4659413 | 44.617945 | -3.497883 | 0.0014541 | 0.9999999 | -4.588065 |
| SCARNA9L | -20.4659413 | 44.617945 | -3.497883 | 0.0014541 | 0.9999999 | -4.588065 |
| LOC100422687 | -0.8252775 | 1.437967 | -3.494073 | 0.0014689 | 0.9999999 | -4.588077 |
| DHCR24 | -4.5308751 | 6.176493 | -3.276204 | 0.0026131 | 0.9999999 | -4.588787 |
| TTC12 | -0.9090289 | 1.907452 | -3.226275 | 0.0029764 | 0.9999999 | -4.588949 |
| LUC7L | -1.7024605 | 7.035221 | -3.188485 | 0.0032830 | 0.9999999 | -4.589073 |
| ZNF558 | -2.7995957 | 6.853006 | -3.185833 | 0.0033056 | 0.9999999 | -4.589081 |
| ZNF492 | 0.9136145 | 1.588075 | 3.039347 | 0.0048135 | 0.9999999 | -4.589559 |
| NFKB1 | 33.4942228 | 153.273495 | 2.972237 | 0.0057046 | 0.9999999 | -4.589778 |
topfit2 <- topTable(bayes2,
coef=ncol(model_design1),
adjust.method = "BH",
number = nrow(exprMat2))
output_hits2 <- merge(rownames(tp2),
topfit2,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits2 <- output_hits2[order(output_hits2$P.Value),]
knitr::kable(output_hits2[1:10,],type="html",row.names = FALSE)
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B |
|---|---|---|---|---|---|---|
| RAP2C-AS1 | -0.7349751 | 1.284197 | -5.214165 | 0.0000121 | 0.1455996 | 1.1430711 |
| PER1 | -4.4860643 | 4.505854 | -4.750732 | 0.0000453 | 0.2138933 | 0.3979416 |
| GPR18 | 1.9377854 | 2.782914 | 4.692489 | 0.0000535 | 0.2138933 | 0.3030249 |
| LTB | 47.1705702 | 100.863859 | 4.285493 | 0.0001685 | 0.2197084 | -0.3653735 |
| ATG2B | -13.5570936 | 74.737311 | -4.188166 | 0.0002212 | 0.2197084 | -0.5260223 |
| ARHGAP29 | -3.9317575 | 4.676844 | -4.176950 | 0.0002283 | 0.2197084 | -0.5445428 |
| ZNF74 | 1.3868168 | 2.157942 | 4.141617 | 0.0002519 | 0.2197084 | -0.6028913 |
| ZBTB16 | -8.1381353 | 8.186483 | -4.129646 | 0.0002604 | 0.2197084 | -0.6226608 |
| ARG1 | -360.5169794 | 399.496637 | -4.076849 | 0.0003016 | 0.2197084 | -0.7098509 |
| GPCPD1 | -106.3618948 | 269.094891 | -4.050477 | 0.0003245 | 0.2197084 | -0.7533959 |
After calculating p-values for the genes in my expression set, I found that there were 379 DE genes in the T1 data, and 1774 DE genes in the T1 data. This is using a p-value of 0.05, which was selected because it allowed for a reasonable number of genes to be “significant” at each time point. A more lenient threshold would result in too many genes (statistically less rigorous, and difficult to work with), and a more stringent threshold may result in fewer significant pathways being found downstream.
To correct my p-values, I used the Benjamini-Hochberg method, as this was what was used by the authors of the original paper as well. This method is known to be less stringent than the Bonferroni method, which I think is desirable for this dataset as the nuances between the R and NR groups may be relatively small.
length(which(output_hits1$P.Value < 0.05))
## [1] 244
length(which(output_hits2$P.Value < 0.05))
## [1] 1774
length(which(output_hits1$adj.P.Val < 0.05))
## [1] 0
length(which(output_hits2$adj.P.Val < 0.05))
## [1] 0
Unfortunately, after correcting the P-values, none of the genes pass correction.
tp1MA <- tp1 %>%
rownames_to_column("gene") %>%
pivot_longer(cols=colnames(tp1), names_to="name", values_to="expr")%>%
merge(expGroups1, by="name") %>%
merge(output_hits1, by.x="gene", by.y="x") %>%
mutate(Signif=if_else(P.Value < 0.05, "Sig", "Insig")) %>%
mutate(Reg = case_when((P.Value < 0.05 & logFC < 0)~"Down",
(P.Value < 0.05 & logFC > 0)~"Up",
(P.Value > 0.05 & logFC > 0)~"Not Significant",
(P.Value > 0.05 & logFC < 0)~"Not Significant"))
ggplot(as.data.frame(tp1MA), aes(x=logFC, y=-log10(P.Value), colour=as.factor(Reg)))+
geom_point()+
ggtitle("Volcano Plot for T1 Expression")
Figure 2: Volcano plots of DE genes
tp2MA <- tp2 %>%
rownames_to_column("gene") %>%
pivot_longer(cols=colnames(tp2), names_to="name", values_to="expr")%>%
merge(expGroups2, by="name") %>%
merge(output_hits2, by.x="gene", by.y="x") %>%
mutate(Signif=if_else(P.Value < 0.05, "Sig", "Insig")) %>%
mutate(Reg = case_when((P.Value < 0.05 & logFC < 0)~"Down",
(P.Value < 0.05 & logFC > 0)~"Up",
(P.Value > 0.05 & logFC > 0)~"Not Significant",
(P.Value > 0.05 & logFC < 0)~"Not Significant"))
ggplot(as.data.frame(tp2MA), aes(x=logFC, y=-log10(P.Value), colour=as.factor(Reg)))+
geom_point()+
ggtitle("Volcano Plot for T2 Expression")
. Heatmaps of top hits:
suppressPackageStartupMessages(library(ComplexHeatmap))
signifGenes <- output_hits1$x[which(output_hits1$P.Value < 0.05)]
signifExpr1 <- tp1[which(rownames(tp1) %in% signifGenes),]
signifExpr1 <- t(scale(t(signifExpr1)))
dend1 = cluster_between_groups(signifExpr1, expGroups1$response)
dend_col = c("R" = 3, "NR" = 2)
hm1 <- ComplexHeatmap::Heatmap(as.matrix(signifExpr1),
cluster_columns=dend1,
top_annotation=
HeatmapAnnotation(response=as.factor(expGroups1$response), col = list(response = dend_col)),
row_names_gp = grid::gpar(fontsize = 3),
column_names_gp = grid::gpar(fontsize = 5),
name="Before treatment")
signifGenes2 <- output_hits2$x[which(output_hits2$P.Value < 0.05)]
signifExpr2 <- tp2[which(rownames(tp2) %in% signifGenes2),]
signifExpr2 <- t(scale(t(signifExpr2)))
dend2 = cluster_between_groups(signifExpr2, expGroups2$response)
dend_col = c("R" = 3, "NR" = 2)
hm2 <- ComplexHeatmap::Heatmap(as.matrix(signifExpr2),
cluster_columns=dend2,
top_annotation=
HeatmapAnnotation(response=as.factor(expGroups2$response), col = list(response = dend_col)),
row_names_gp = grid::gpar(fontsize = 3),
column_names_gp = grid::gpar(fontsize = 5),
name="After treatment")
hm1
Figure 3A: Heatmaps of expression values for DE genes across all patients at T1
hm2
Figure 3B: Heatmaps of expression values for DE genes across all patients at T2
In the heatmaps above, I deliberately clustered together the experimental groups (R and NR) in order to see the similarities within the groups. However, Since there does seem to be more intra-group similarity than inter-group similarity, it is likely that the experiment groups would cluster together without deliberately specifying so.
Create lists of up and downregulated genes:
tp1Sig <- output_hits1[which(output_hits1$P.Value < 0.05),]
tp1Upreg <- tp1Sig[which(tp1Sig$logFC > 0),]
tp1Downreg <- tp1Sig[which(tp1Sig$logFC < 0),]
tp2Sig <- output_hits1[which(output_hits2$P.Value < 0.05),]
tp2Upreg <- tp2Sig[which(tp2Sig$logFC > 0),]
tp2Downreg <- tp2Sig[which(tp2Sig$logFC < 0),]
Use g:profiler for ORA:
suppressPackageStartupMessages(library(gprofiler2))
gostrestp1 <- gost(tp1Sig$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostrestp1Up <- gost(tp1Upreg$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostrestp1Down <- gost(tp1Downreg$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostrestp2 <- gost(tp2Sig$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostrestp2Up <- gost(tp2Upreg$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostrestp2Down <- gost(tp2Downreg$x, sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"))
gostplot(gostrestp1)
Figure 4A: Manhattan Plot of ORA results for all DE genes at T1
gostplot(gostrestp2)
Figure 4B: Manhattan Plot of ORA results for all DE genes at T2
gostplot(gostrestp1Up)
Figure 4C: Manhattan Plots of ORA results for upregulated genes at T1
gostplot(gostrestp1Down)
Figure 4D: Manhattan Plots of ORA results for downregulated genes at T1
gostplot(gostrestp2Up)
Figure 4E: Manhattan Plots of ORA results for upregulated genes at T2
gostplot(gostrestp2Down)
Figure 4F: Manhattan Plots of ORA results for downregulated genes at T2
gostrestp1 <- gost(tp1Sig$x,
sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"),
user_threshold = 0.001)
gostrestp2 <- gost(tp2Sig$x,
sources=c("GO:BP", "GO:MF", "GO:CC", "KEGG"),
user_threshold = 0.001)
gostplot(gostrestp1)
Figure 4G: Manhattan Plots of ORA results for T1, using threshold of 0.001
gostplot(gostrestp2)
Figure 4H: Manhattan Plots of ORA results T2, using threshold of 0.001
For my over representation analysis, I chose to use the gprofiler2 R package. I found g:Profiler very convenient to use in the journal assignment. As well, I thought that g:profiler was extremely thorough as it allows your to compare your gene list to multiple different databases at once. You can also adjust your significance thresholds.
The annotation sources used in this analysis are GO: Molecular Functions, GO: Cellular Components, GO: Biological Pathways, and KEGG. These were selected because the authors of the original paper used a tool called ClueGO, which is a plugin for Cytoscape (Bindea et al., 2009). Since Cytoscape was not used in my assignment, I instead used the datasets that make up ClueGO directly, which are GO and KEGG. I believe g:profiler uses the most up to date versions of GO and KEGG, as the g:profiler website was last updated in February of this year. The R package is supposed to return the same results as using the website, so the results above should also be using the most recent versions of the annotation sources.
With the default thresholding values from gost (p < 0.05), 49 pathways were returned for the T1 gene list, and 278 pathways were returned for the T2 gene list. The authors used a thresholding value of 0.001. When this was applied in my assignment, 8 pathways were returned for the T1 gene list, and 166 pathways were returned for the T2 gene list. Albeit, this threshold was used on a slightly different method (ClueGO from Cytoscape), and the authors had different upstream pre-processing than what was done in this assignment.
For each of the two time points, the ORA was run again for the up- and downregulated gene sets. At T1, none of the pathways were above the threshold when considering the entire set of DE genes. The same is true when only considering up- and downregulated genes individually. At T2, when considering the entire gene list, there are many pathways that are above the threshold. However, the downregulated genes have much fewer pathways that pass the threshold. There are only two, which are protein binding and cytosol related pathways. The pathways that pass the threshold when only considering the up- or downregulated genes seem to be subsets of the pathways that pass the threshold when considering all genes. The results may be different in these cases because the background genes that are being considered are different in each set.
As well, the authors specified a slightly different model than mine. They split the data into two groups based on response to treatment (R vs NR), and specified the model (counts ~ patient + timepoint) for each group. In my analysis, I split the data into two groups based on time point, and specified the model (counts ~ response) for each timepoint. Although it is difficult to compare the two approaches, it is interesting to note that both my analysis and the paper found more significantly up/downregulated pathways at the second time point. Moreover, the heatmap plotted by the authors for the DE genes they found followed an extremely similar pattern as my heatmaps. This suggests that although the two analyses had different results, they may have found the same overall patterns in the data.
However, one issue with my analysis is that the pathways that it found to be significant are not the same as those found by the authors. The authors found immune-related pathways to be significant, which were not found in my analyses. The pathways found to be significant in my assignment were related to cellular metabolic processes and cell death. Again, this may be due to their use of Cytoscape, as their method seems to perform grouping of terms, which was not done in my assignment. It is also possible that the GO and KEGG datasets have been updated since the paper was published.
One area of interest in this assignment is the specification of the linear model. Since I was interested in exploring the differences in expression between the R and NR groups, I split the dataset into two based on the time point at which the data was collected. Then, I specified my model as (counts ~ response). The patient variable was not included in the model. This is because including the patient variable in the linear model from limma resulted in NA coefficients for one of the variables in the model. This suggests that the response variable may be collinear with the patient variable, and thus would be redundant to include both.